Goto

Collaborating Authors

 reliability score




Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification

Das, Soumojit, Dasgupta, Nairanjana, Dutta, Prashanta

arXiv.org Machine Learning

Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain -- even well-calibrated models provide no mechanism to identify \textit{which specific predictions} are unreliable. We develop a geometric framework addressing both calibration and instance-level uncertainty quantification for neural network probability outputs. Treating probability vectors as points on the $(c-1)$-dimensional probability simplex equipped with the Fisher--Rao metric, we construct: (i) Additive Log-Ratio (ALR) calibration maps that reduce exactly to Platt scaling for binary problems while extending naturally to multi-class settings, and (ii) geometric reliability scores that translate calibrated probabilities into actionable uncertainty measures, enabling principled deferral of ambiguous predictions to human review. Theoretical contributions include: consistency of the calibration estimator at rate $O_p(n^{-1/2})$ via M-estimation theory (Theorem~1), and tight concentration bounds for reliability scores with explicit sub-Gaussian parameters enabling sample size calculations for validation set design (Theorem~2). We conjecture Neyman--Pearson optimality of our neutral zone construction based on connections to Bhattacharyya coefficients. Empirical validation on Adeno-Associated Virus classification demonstrates that the two-stage framework captures 72.5\% of errors while deferring 34.5\% of samples, reducing automated decision error rates from 16.8\% to 6.9\%. Notably, calibration alone yields marginal accuracy gains; the operational benefit arises primarily from the reliability scoring mechanism, which applies to any well-calibrated probability output. This work bridges information geometry and statistical learning, offering formal guarantees for uncertainty-aware classification in applications requiring rigorous validation.


Subnational Geocoding of Global Disasters Using Large Language Models

Ronco, Michele, Delforge, Damien, Jäger, Wiebke S., Corbane, Christina

arXiv.org Artificial Intelligence

Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.


A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain

Lu, Yining, Tang, Wenyi, Johnson, Max, Jung, Taeho, Jiang, Meng

arXiv.org Artificial Intelligence

Existing retrieval-augmented generation (RAG) systems typically use a centralized architecture, causing a high cost of data collection, integration, and management, as well as privacy concerns. There is a great need for a decentralized RAG system that enables foundation models to utilize information directly from data owners who maintain full control over their sources. However, decentralization brings a challenge: the numerous independent data sources vary significantly in reliability, which can diminish retrieval accuracy and response quality. To address this, our decentralized RAG system has a novel reliability scoring mechanism that dynamically evaluates each source based on the quality of responses it contributes to generate and prioritizes high-quality sources during retrieval. To ensure transparency and trust, the scoring process is securely managed through blockchain-based smart contracts, creating verifiable and tamper-proof reliability records without relying on a central authority. We evaluate our decentralized system with two Llama models (3B and 8B) in two simulated environments where six data sources have different levels of reliability. Our system achieves a +10.7\% performance improvement over its centralized counterpart in the real world-like unreliable data environments. Notably, it approaches the upper-bound performance of centralized systems under ideally reliable data environments. The decentralized infrastructure enables secure and trustworthy scoring management, achieving approximately 56\% marginal cost savings through batched update operations. Our code and system are open-sourced at github.com/yining610/Reliable-dRAG.


M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Sun, Mengzhou, Zhao, Sendong, Chen, Jianyu, Wang, Haochun, Qin, Bin

arXiv.org Artificial Intelligence

Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.


Data Reliability Scoring

Chen, Yiling, Feng, Shi, Kattuman, Paul, Yu, Fang-Yi

arXiv.org Machine Learning

How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment -- a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.




Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability

Yoo, Seungju, Kwon, Hyuk, Hwang, Joong-Won, Lee, Kibok

arXiv.org Artificial Intelligence

Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at https://github.com/YonseiML/autoeval-det.